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Abstract — Reconstruction of a function from noisy data is 
often formulated as a regularized optimization problem over an 
infinite-dimensional reproducing kernel Hilbert space (RKHS). 
The solution describes the observed data and also has a small 
RKHS norm. When the data fit is measured using a quadratic 
loss, this estimator has a known statistical interpretation in terms 
of Gaussian random fields: it provides the minimum variance 
estimate of the unknown function given the noisy measurements. 
In this paper, we provide a statistical interpretation when more 
general losses are used, such as Vapnik or Huber. For a given 
data set, the RKHS function estimate contains all the possible 
maximum a posteriori estimates of a random field for which the 
prior distribution is Gaussian. 

Index Terms — kernel based regularization; Gaussian processes; 
representer theorem; reproducing kernel Hilbert spaces; regular- 
ization networks; support vector regression 



I. Introduction 

Minimizing a regularized functional with respect to a repro- 
ducing kernel Hilbert space (RKHS) is a popular approach 
to reconstruct a function F : 3^ R from noisy data; e.g. 
see [2|, |17|, |19|, |6|. To be specific, regularization in Jff 
estimates F using F defined by 



F = arg min ^ Vi[yi - + 




(1) 



where 7 G R+ is the regularization parameter, ^ is a set 
(possibly finite), x, e ^ is the location where e is 
measured, V; : R — > R+ is the loss function for yi, and || • ||^^ is 
the RKHS norm induced by the positive definite reproducing 
kernel K : ^ x ^ ^R, see [[2l. Notice that x, ^ R and is 
not a component of a vector x, as is the convention for other 
subscripts in this paper (unless stated otherwise; e.g., V,- is a 
function, not a value in R). 

One of the important features of the above approach is that, 
even if the dimension of is infinite, the solution belongs to 
a finite-dimensional subspace. In fact, under mild assumptions 
on the loss, according to the representer theorem [EOl . lUFl, 
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F in ([Til is the sum of kernel sections K, : ^ R defined by 

Ki{x) =K{xi,x). To be specific. 



i=l 



(2) 



where c is defined by 
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Here and below, K e r'^^^ is the so called kernel matrix, or 
Gram matrix, defined by Kij — K{xi,Xj). When the component 
loss functions V;(-) are quadratic, the problem in ([T) admits 
the structure of a regularization network [13] and also has a 
statistical interpretation: F is the minimum variance estimate 
of a Gaussian random field, given measurements corrupted by 
Gaussian white noise and a prior covariance proportional to 
K. This connection, briefly reviewed in Section |III] is well 
known in the literature and was initially studied in [lO] in the 
context of spline regression, see also [19] , [7], [15 \- It can be 
proved using the representer theorem together with the closed 
form solutions of the coefficients c, in (|2]l: 
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(4) 



where y G R'^ is the vector of measurements y,- and In is the 
N xN identity matrix. 

A formal statistical model for more general loss functions 
(e.g., the Vapnik e-insensitive loss used in support vector 
regression lITSl . 12], |I8|) is missing from the literature. After 
interpreting the Vi as alternative statistical models for the ob- 
servation noise, many papers argue that F in ([TJ can be viewed 
as a maximum a posteriori (MAP) estimator assuming the a 
priori probability density of F proportional to exp(— 
e.g. see Section 7 in |5|. These kinds of statements are 
informal, since in an infinite-dimensional function space the 
concept of probability density is not well defined, see e.g. 
[3 1 for a thorough treatment of Gaussian measures. The main 
contribution of this note is to provide a statistical model that 
justifies F as an estimate of a Gaussian random field. 

The structure of the paper is as follows. In Section |II] we 
formulate the statistical model. In Section [nil we review the 
connection between regularized estimation in RKHS and esti- 
mation in the quadratic case, and then extend this connection 
to more general losses. Section |IV] contains a summary and 
conclusion. The proofs are presented in Section |V] 



II. Statistical Model 

Here and below, E[-] indicates the expectation operator and, 
given (column) random vectors u and v, and we define 

cov[m,v] =E [(m-E[m])(v-E[v])'^] . 

We assume that the measurements y, are obtained by sampling 
the function F in the presence of additive noise, i.e. 



Using the notation y= a^/X, one obtains 



(5) 



where each jc, is a known sampling location. We make the 
following assumptions: 

Assumption 1: We are given a known positive definite 
autocovariance function AT on x and a scalar A > 
such that for any sequence of points {xj : j = I,..., J}, the 
vector f~ [F(xi), . . . ,F{xj)] is a Gaussian random variable 
with mean zero and covariance given by 

cov (/; Jk)^XK{xj,Xk) . ^ 

A random function F that satisfies Assumption [T] is often 
referred to as a zero-mean Gaussian random field on ^ . 

Assumption 2: We are given a sequence of measurement 
pairs {xi,yi) G ^ x R and corresponding loss functions V,- for 
/ = l,...,N. In addition, we are given a scalar a >Q such that 



v,\y,-F{xi)] 

2(j2 



Furthermove, the measurement noise random variables e, — 
yt —F{xi) are independent of the the random function F. g 
For example, Vi{r) = corresponds to Gaussian noise, 
while using V,(r) — \r\ corresponds to Laplacian noise. The 
statistical interpretation of an £-insensitive Vi in terms of Gaus- 
sians with mean and variance described by suitable random 
variables can be found in lil4J. 



III. Maximum a posteriori estimation and 

REPRODUCING KERNEL HiLBERT SPACES 
A. Gaussian measurement noise 

We first consider the case of Gaussian measurement noise; 
i.e., y,(r) = r^. This corresponds to modeling the {e,} as i.i.d. 
Gaussian random variables with variance a^. In view of the 
independence of F and e, it turns out that F{x) and y are 
jointly Gaussian for any x e ,9^ . Hence, the posterior ^[F{x) \y] 
is also Gaussian. The mean and variance for this posterior can 
be calculated using the following proposition 1 1 , Example 3.6]. 

Proposition 3: Suppose u and v are jointly Gaussian ran- 
dom vectors. Then, p(m|v) is also Gaussian with mean and 
autocovariance given by 

E(m|v') = E(m) +cov(m, v)cov(v, v)^^ [v — E(v')] , 
cov(m,m|v) = cov(m,m) — cov(m,v)cov(v, v)^'cov(v, m) . 

Suppose Assumptions [T] |2] and Vi{r) — r^ for i — 1,... ,N. 
It follows that y is Gaussian. Applying Proposition |3] with 
u — Fix) and v ~y, we obtain E(m) = 0, E(v') — 0, and 

^[F{x)\y]^l[Ki{x) ... KN{x)]{lK+(7%y^y . 



E[Fix)\y] 



N 



Kn{x)] {K+Yl^ 



!=I 



where c is computed using (|4]l. This expression shows that the 
minimum variance estimate coincides with F defined by ([T]i. 
This result is formalized in the following proposition. 

Proposition 4: Suppose that F satisfies Assumption [T] and 
p(3'|F) satisfies Assumption |2] with Vi{r) ~ r^. Then, the 
minimum variance estimate of F{x) given y is F{x) defined 
by dU, with 7 = a^/X and the RKHS induced by K. ■ 

B. More general measurement noise 

We now consider what happens when the Gaussian assump- 
tions on e; are removed. If the probability density function for 
F was well defined and given by 



p(F) °^ exp 



\F\\lr 
2X 



then the posterior density conditional on the data would be 



p(F|y)ocexp -£ 



Vi[yi-F{xi)] \\F\\ 



-ye 
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In this case, the negative log of p(/^|3') would be proportional 
to the objective in ([TJ. Hence, one could immediately conclude 
that F is the MAP estimator. Unfortunately, the posterior 
density of F on a function space is not well defined, and this 
explains the somewhat indirect formulation of Proposition |5] 
below. Note that this proposition refers to the MAP estimator 
for /, a finitely sampled version of F . 

Proposition 5: Suppose that F satisfies Assumption [T] and 
p(3'|F) satisfies Assumption |2] Let M be any non-negative 
integer, {xi : / = + 1 , . . . , + M} be an arbitrary set of 
points in X , and define 

f=\F{x{),...,F(x^^Mf. 

Then the MAP estimate for / given y is 

argmaxp(y|/)p(/) = {F [xi),. . . ,F{xN+M)f , 

where F is defined by ([B, with y^a^/X and Jif is the RKHS 
induced by K. g 
Remark 6: When considering non-Gaussian loss functions, 
the minimum variance and MAP estimates are different. Con- 
sider e.g. the case where N = I, M = 0, Vi{r) = \r\, y = I, and 
A = 1, CT = 1, K{xi,xi) = 1. For this case, / — F{xi), and the 
MAP estimate for / given y is 



Define A by 



/ = argmm(/ + |l-/|)=l/2. 



exp(-/2-|l-/|)d/. 



(6) 



The minimum variance estimate minus the MAP estimate is 
exp(-3/4) 



E(/b)-/ = 



1/2 



' exp(l— 25) — 1 

s ^ ds . 

exp(i^j 



(7) 



It follows that A > 0, for s > 1/2, the integrand above is 
negative, the righthand side is negative, and E(/|37) < /. g 

IV. Conclusion 

When the RKHS induced by K is infinite-dimensional, the 
realizations of the Gaussian random field with autoco variance 
K do not fall in J^f with probability one, see eq. 34 in [TJJ 
and also ||9|, E), ifTTI for generalizations. A simple heuristic 
argument illustrating this fact can be also found in Chapter 
1 of |19|. The intuition here is that the realizations of F are 
much less regular than functions in the RKHS whose kernel 
is equal to the autocovariance K. On the other hand, in the 
case of Gaussian measurement noise, F defined in ([T]) is the 
minimum variance estimate; see Proposition |4] In this note we 
proved a formal connection between Bayesian estimation and 
the more general case prescribed by Assumption |2] For a given 
data set {(x,-,};,)}, the RKHS function estimate F contains all 
the possible maximum a posteriori estimates of a random field 
for which the prior distribution is Gaussian. This result extends 
to more general cases by using more general versions of the 
Representer Theorem; see (|2). 



and assume the above maximizers are unique. It follows that 



V. Appendix 



A. Lemmas 



We begin the appendix with two lemmas which are instru- 
mental in proving Proposition |5] 

Lemma 7: Suppose that g and h are jointly Gaussian ran- 
dom vectors. It follows that 

maxlogp(/z|g) = 

h 

-logdet{27r [cov(/!,/!) -cov(/i,g)cov(g,g)"'cov(§,/!)] } /2 , 

and this maximum does not depend on the value of g. 

Proof: The proof comes from well known properties of 
joint Gaussian vectors, see e.g. ||T|. The conditional density 
p(/!|§) is Gaussian and is given by 

— 21ogp(/!|g) = logdet[27rcov(/!,/i|g)] 

+ [/l-E(/2|g)]TcOv(/z,/2|g)-'[/2-E(/l|g)], 

where, recalling also Proposition |3] 

cov(/i,/z|g) ^cov{h,h) —cov{h,g)cov{g,g)^^cov{g,h) . 

Thus, coY{h,h\g) does not depend on the value of g (and it 
would not make sense for it to depend on the value of h). 
Hence, one has 

argmaxp(/!|g) =E(/i|g) , 



maxlogp(/!|g) 

h 



logdet[27i;cov(/i,/i|g)]/2 



This equation, and the representation for co\{h,h\g) above 
completes the proof of this lemma. ■ 
Lemma 8: Assume that g and h are jointly Gaussian random 
vectors and that j is a random vector such that p{y\g,h) = 
T?{y\g), and suppose we are given a value for y. Define the 
corresponding estimates for g and h by 

{g,h) ^ ia-gmiixp{y,g,h) , 

g.h 



g = argmaxp(y|^)p(g) , 

g 

h = argmaxp(/!|g = |) . 

h 

Proof: We have 

T^{y,g,h) = 9{y\g,h)p{h\g)p{g) , 
= 9{y\g) 9{g) V{h\g) , 



(8) 
(9) 



maxp(y,g,/j) 

g.h 



max<^ [p(3'l^) P(^)] maxp(/2|^) 
g \ h 



It follows from Lemma [7] that max/,p(/i|g) is constant with 
respect to g. Hence 

g = argmax[p(y|g) p(g)] , 

g 

which completes the proof of (O. Hence 

maxp(y,g,/z) =p(3;||)p(|)maxp(/i|§ = |) , 

gji h 

which completes the proof of (|9|l. ■ 

B. Proof of Proposition \5\ 

Proof: The kernel matrix K is positive definite and hence 
invertible (Assumption [T]!. Define the random vectors g and h 
by 

g = [F(xi),...,FM]T, 

h = [F{xf]+i),...,F{xN+M)V ■ 

It follows that / in Proposition |5] is given by / — {g^ .h^Y . 
Notice that p{y\f) — 9{y\g) and that Lemma[8]can be applied. 
From ([8]l and the hypotheses above, we obtain 

g = argmaxp(3;|g)p(g) , 



g'K-'g 



Using the representation g — Kcv^t obtain 



c= ai-gmax( 



7=1 



c^Kc 
21 



This agrees with (|3]l, because y= /X, and thereby shows 

g^[F{x,),--- ,F{xn)]^ . 

Finally, by Proposition |3] and Lemma |7] in conjuction with dU, 
(|9|l, and the expression for g above, we obtain 

h = coy{h,g)cov{g,gy^g 
= cowih,g)iXK)-'(Kc), 

Ki{xN+i) ... Kn{xn+i) \ / ci 



Ki{xn+m) ■■■ Kn{xn+m) / \ 
= [F{xn+i),---,F{xn+m)V ■ 
Combining this with the formula for g above, we conclude 

[F{xi),. . . ,F{xn+m)V = argmaxp(y,/) , 
which completes the proof of Proposition |5] I 



C. Proof of Remark [7| 

Proof: It suffices to show that equations ^ and (|7|l hold. 
It follows from = 1, 7 = 1, that c is a scalar, / = F{xi ) — c, 
and using (O we have 

f — c = argmin 1 1 — c| + = 1/2 

c 

It also follows that 

P(3'l/)p(/)-exp(-/2-|l-/|) 

The minimum variance estimate E(/|3'), and its difference 
from the map estimate /, are given by 



E(/b) - / = J j ^(/- l/2)exp(-/^ - 1 +/)d/ 



(/-l/2)exp(-/2 + l-/)d/. 



Multiplying both sides of the equation by A and using the 
change of variables = / — 1/2, we obtain 



A(E(/ly)-/) 
/■1/2 

= / iexp[-(5+ 1/2)2 + 5- l/2]ds 

^ — 00 



1/2 



iexp[-(i + 1 /2f -s+\ /2]di , 



/■1/2 , 
: / iexp(-5" — 3/4)ds 

J — 00 

+ / iexp(-i2-2i+l/4)di, 

r-l/2 

- I 5exp(-.?2-3/4)d.s 

J — 00 



iexp(-r - 2i + 1 /4)di , 



1/2 



iexp(-52-3/4)[exp(l -2i) - \]As . 



1/2 

This completes the proof of the remark. 
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